[WIP] eval: unified Evaluator + EvalManager refactor#420
Open
eugenevinitsky wants to merge 34 commits intoemerge/temp_trainingfrom
Open
[WIP] eval: unified Evaluator + EvalManager refactor#420eugenevinitsky wants to merge 34 commits intoemerge/temp_trainingfrom
eugenevinitsky wants to merge 34 commits intoemerge/temp_trainingfrom
Conversation
…ation yet)
Adds the new evaluator framework as pure additions:
pufferlib/ocean/benchmark/evaluators/
base.py Evaluator + EvalResult dataclass
multi_scenario.py MultiScenarioEvaluator (replaces eval_multi_scenarios)
human_replay.py HumanReplayEvaluator (replay + control_sdc_only loop)
behavior_class.py BehaviorClassEvaluator (per-class nuPlan suite)
wosac.py Thin wrapper around the existing WOSACEvaluator
pufferlib/ocean/benchmark/manager.py
EvalManager: section discovery, inheritance chain, clean-eval macro,
dotted-key flattening, inline + subprocess dispatch, wandb logging.
Plus tests/test_eval_manager.py for the config-merge logic.
See docs/eval_unification.md for the full design rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-discovered evaluator sections, schema A (dotted keys). Each behavior class is its own section inheriting from behaviors_defaults template; gigaflow validation has separate metric-only and render sections at different intervals. driving_behaviours_eval.ini deleted — folded into drive.ini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed from pufferl.py (~990 lines): eval(), eval_multi_scenarios(), eval_multi_scenarios_render(), build_eval_overrides(), load_eval_multi_scenarios_config(), _swap_policy_obs_counts(), _render_driving_behaviours(), _export_metrics(), _log_eval_metrics(), verify_scenario_coverage(), verify_scenario_coverage_gigaflow(). Plus the legacy eval block in PuffeRL.evaluate() and the driving_behaviours_eval.ini loader in load_config. Removed from utils.py (~300 lines): run_human_replay_eval_in_subprocess, run_wosac_eval_in_subprocess, run_driving_behaviours_eval_in_subprocess. PuffeRL.evaluate() now calls self._eval_manager.maybe_run() — single unified path for all evals. main() wires `puffer eval --evaluator <name> --out <path>` for both standalone and subprocess use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EPLAY in c_step Add GOAL_ADVANCE_REGENERATE / GOAL_ADVANCE_SATURATE constants and a goal_advance_mode field on Drive. c_step's last-goal branch now dispatches on goal_advance_mode instead of simulation_mode. Drive.__init__ accepts goal_advance_mode kwarg with auto-pick based on simulation_mode (gigaflow → regenerate, replay → saturate) — same behavior as before, but the choice is explicit and per-eval-overridable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors PufferDrive’s evaluation system by replacing multiple overlapping eval entry points with a unified Evaluator framework and an EvalManager that discovers [eval.<name>] sections from drive.ini and runs them inline or via subprocess. It also introduces a new C/Python goal_advance_mode knob to decouple last-goal behavior from simulation_mode.
Changes:
- Added
pufferlib/ocean/benchmark/evaluators/(baseEvaluator+ concrete evaluators) andpufferlib/ocean/benchmark/manager.py(discovery, inheritance merge, clean macro, dispatch, logging). - Refactored
pufferlib/pufferl.pyeval CLI + training loop to route evaluation throughEvalManager; removed legacy subprocess eval helpers frompufferlib/utils.py. - Updated
drive.inito[eval.<name>]suites (with templates + inheritance), removeddriving_behaviours_eval.ini, and addedgoal_advance_modethrough Python/C bindings.
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_eval_manager.py |
New unit tests for eval-section discovery, inheritance merge, dotted-key expansion, and clean macro behavior. |
README.md |
Updates cluster install wording (NYU cluster specifics). |
pufferlib/utils.py |
Removes legacy subprocess eval helpers; retains video rendering helper. |
pufferlib/pufferl.py |
Routes eval through EvalManager; rewires CLI puffer eval --evaluator ... --out ...; removes legacy eval pipelines. |
pufferlib/ocean/drive/drive.py |
Adds Python-side goal_advance_mode and passes it into C env init kwargs. |
pufferlib/ocean/drive/drive.h |
Adds GOAL_ADVANCE_* constants + goal_advance_mode field; switches last-goal behavior to the new knob in c_step. |
pufferlib/ocean/drive/binding.c |
Unpacks goal_advance_mode into the C env struct. |
pufferlib/ocean/benchmark/manager.py |
New EvalManager with section discovery, inheritance merge, clean macro, inline/subprocess dispatch, and logging. |
pufferlib/ocean/benchmark/evaluators/base.py |
New Evaluator base class + EvalResult. |
pufferlib/ocean/benchmark/evaluators/multi_scenario.py |
New multi-scenario evaluator + optional EGL render pass. |
pufferlib/ocean/benchmark/evaluators/human_replay.py |
New human replay evaluator implementation. |
pufferlib/ocean/benchmark/evaluators/behavior_class.py |
New behavior-class evaluator (sampling + symlink-dir cleanup). |
pufferlib/ocean/benchmark/evaluators/wosac.py |
New WOSAC evaluator adapter around existing realism evaluator logic. |
pufferlib/ocean/benchmark/evaluators/__init__.py |
Registers evaluator types for manager lookup. |
pufferlib/config/ocean/driving_behaviours_eval.ini |
Deleted (folded into drive.ini). |
pufferlib/config/ocean/drive.ini |
Replaces [eval] with [eval.<name>] suites + templates/inheritance; adds WOSAC and behavior-class sections. |
.gitignore |
Un-ignores pufferlib/ocean/benchmark/** (was previously matched by benchmark*/). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+62
to
+68
| obs, _, _, _, info_list = vecenv.step(action_np) | ||
| if info_list: | ||
| all_infos.extend(info_list) | ||
| # Stop once every bin has yielded one info to avoid double-counting | ||
| # on the second cycle through the dir. | ||
| if len(all_infos) >= num_maps: | ||
| break |
Comment on lines
+105
to
+125
| def _run_inline(self, ev: Evaluator, policy, env_name: str, global_step) -> EvalResult: | ||
| args = self._build_eval_args(ev, env_name=env_name, global_step=global_step) | ||
|
|
||
| package = args.get("package", "ocean") | ||
| module_name = "pufferlib.ocean" if package == "ocean" else f"pufferlib.environments.{package}" | ||
| env_module = importlib.import_module(module_name) | ||
| make_env = env_module.env_creator(env_name) | ||
|
|
||
| vec_kwargs = ev.vec_overrides() | ||
| num_envs = int(vec_kwargs.get("num_envs", 1)) | ||
| env_kwargs_list = [args["env"] for _ in range(num_envs)] | ||
| env_creators = [make_env] * num_envs | ||
| env_args_list = [[]] * num_envs | ||
|
|
||
| vec_call_kwargs = dict(vec_kwargs) | ||
| vec_call_kwargs.setdefault("num_workers", num_envs) | ||
| vec_call_kwargs.setdefault("batch_size", num_envs) | ||
|
|
||
| vecenv = pufferlib.vector.make( | ||
| env_creators, env_args=env_args_list, env_kwargs=env_kwargs_list, **vec_call_kwargs | ||
| ) |
Comment on lines
+132
to
+137
| def _run_subprocess(self, ev: Evaluator, env_name: str, global_step) -> EvalResult: | ||
| out_path = Path(self.train_config.get("data_dir", ".")) / "eval_subprocess_out" / f"{ev.name}.json" | ||
| out_path.parent.mkdir(parents=True, exist_ok=True) | ||
| cfg_path = out_path.with_suffix(".cfg.json") | ||
| with open(cfg_path, "w") as f: | ||
| json.dump({"name": ev.name, "global_step": global_step}, f) |
Comment on lines
+153
to
+155
| subprocess.run(cmd, check=True) | ||
| with open(out_path) as f: | ||
| payload = json.load(f) |
| eval_args["load_model_path"] = os.path.join( | ||
| self.config["data_dir"], experiment_name, "models", f"inline_epoch_{self.epoch}.pt" | ||
| logger=self.logger, | ||
| global_step=self.global_step, |
Comment on lines
+266
to
+284
| env.map_dir = "/scratch/$USER/data/nuplan/nuplan_mini_train_bins" | ||
|
|
||
| [eval.behaviors_hard_stop] | ||
| inherits = "behaviors_defaults" | ||
| type = "behavior_class" | ||
| enabled = true | ||
| env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/hard_stop" | ||
|
|
||
| [eval.behaviors_highway_straight] | ||
| inherits = "behaviors_defaults" | ||
| type = "behavior_class" | ||
| enabled = true | ||
| env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/highway_straight" | ||
|
|
||
| [eval.behaviors_lane_change] | ||
| inherits = "behaviors_defaults" | ||
| type = "behavior_class" | ||
| enabled = true | ||
| env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/lane_change" |
Comment on lines
1
to
8
| import os | ||
| import sys | ||
| import glob | ||
| import random | ||
| import shutil | ||
| import subprocess | ||
| import tempfile | ||
| import json |
Comment on lines
+71
to
+84
| def maybe_run(self, epoch: int, policy, env_name: str, logger=None, global_step=None) -> dict: | ||
| """Called from the training loop. Runs every enabled evaluator | ||
| whose `interval` divides `epoch`. Returns a dict of {eval_name → metrics}.""" | ||
| results = {} | ||
| for ev in self.evaluators: | ||
| if not ev.enabled: | ||
| continue | ||
| if ev.interval <= 0: | ||
| continue | ||
| if epoch % ev.interval != 0: | ||
| continue | ||
| res = self._run_one(ev, policy=policy, env_name=env_name, logger=logger, global_step=global_step) | ||
| results[ev.name] = res | ||
| return results |
Comment on lines
+126
to
+127
| try: | ||
| res = ev.rollout(vecenv, policy, args) |
The HumanReplay and MultiScenario rollouts had ~80% the same step loop (reset → forward_eval → step → collect infos → aggregate). Move it to the base class as `_run_rollout_loop`. Subclasses override only the hooks they actually need to diverge on: _initial_reset (sync vs async reset) _maybe_reset_lstm (per-scenario reset cadence) _should_stop (termination condition) _flatten_infos (multi-worker vs single-worker info shape) _aggregate_infos (per-key mean is the default) _render_pass (no-op default; MultiScenario implements EGL) HumanReplayEvaluator now overrides only `env_overrides` + `_should_stop`. BehaviorClassEvaluator still inherits HumanReplay; only adds the tmp symlink dir for random sampling. WOSAC keeps its custom rollout — its per-scene multi-rollout loop doesn't fit the default shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The benchmark/** unignore was over-broad — it pulled in pycache files that should never be committed. Add an explicit re-ignore for __pycache__ under that path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- eval.render_num_scenarios: explicit per-evaluator render budget; defaults to min(eval.num_scenarios, 3). Renders are expensive (mp4 encode + wandb upload) and shouldn't run at metric scale. - Render path randomizes starting_map per epoch when not pinned, so successive renders show different bins from the dir instead of the first N alphabetically. Restores the old behavior from _render_driving_behaviours that the refactor lost. - behaviors_defaults pins render_num_scenarios = 2 so 12 classes × 2 views × 2 scenarios = 48 mp4s/epoch (vs 1200 if it inherited num_scenarios=50). - Test: render_num_scenarios is inheritable. "Worst N" selection deferred — needs per-scenario scores piped out of the metric rollout into the render call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes:
EvalManager:
- Accepts run_id at construction. Used to locate the per-run
models/ directory.
- latest_checkpoint(env_name) walks data_dir/<env>_<run_id>/models/
for the newest model_*.pt. Falls back to train_config.load_model_path
if no checkpoints exist yet.
- has_subprocess_evals_at(epoch) reports whether any enabled
subprocess evaluator would fire at that epoch.
- _run_subprocess uses latest_checkpoint instead of train_config.load_model_path.
PuffeRL.evaluate:
- Calls save_checkpoint() before maybe_run() if any subprocess
evaluator would fire. Mirrors the old run_driving_behaviours
_eval_in_subprocess flow.
- train() passes logger.run_id when constructing the manager.
For all-inline configs (today's drive.ini default) this is a no-op.
Activates when an evaluator is flipped to mode=subprocess.
Tests: latest_checkpoint picks the newest by ctime, falls back to
load_model_path when no models exist; has_subprocess_evals_at fires
only on enabled subprocess evaluators at matching intervals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unified-eval refactor doesn't actually need this knob, and pushing it through the C struct + binding + drive.py at the same time bloats this PR. Restoring the original `if (env->simulation_mode == SIMULATION_REPLAY)` branch in c_step. The "promote implicit branches to explicit knobs" audit is real work but lives in its own PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erride stack
Adds 4 tests covering the regression-prone surface the parser-only
tests miss:
test_maybe_run_dispatches_by_interval_and_enabled
Stubs _run_one and verifies that maybe_run fires only enabled
evaluators whose interval divides epoch. epoch=33 fires nothing,
epoch=250 fires both 25-interval and 250-interval evaluators.
test_flatten_infos_handles_shape_variations
_flatten_infos must handle multi-worker (list of lists) AND
PufferEnv (flat list) backends, plus None / empty entries —
one bad isinstance check silently drops episode infos.
test_behavior_class_cleanup_removes_symlink_dir
Builds a real 5-bin map_dir, requests a 2-bin sample, verifies
the tmp symlink dir gets created with 2 bins, then cleanup()
removes it. tempfile.mkdtemp leftovers are a real footgun.
test_eval_args_compose_train_section_and_clean_macro
The full override stack: train baseline → section overrides →
clean macro. Section beats baseline, explicit beats macro,
untouched baseline survives.
25 tests total (was 21).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the heavier-eval cadence the rest of the new eval sections use. 25-epoch interval was too aggressive for inline gigaflow validation given the per-pass setup cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Time the rollout in Evaluator.rollout (base class) and inject
`eval_seconds` into the returned metrics dict. Manager's _log
posts it to wandb under {ev.name}/eval_seconds — wall-clock cost
per evaluator becomes a first-class panel.
Refactor WOSACEvaluator to override _run_rollout_loop instead of
rollout — now WOSAC also benefits from the timing without code
duplication.
Test: stub evaluator with a forced 20ms floor; verifies eval_seconds
lands in the result and the inner metrics survive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval-mode resample path at drive.py:447 was missing goal_radius in its kwargs to binding.shared, while the initial-spawn call at line 293 has it. binding.shared's C side (binding.c:1545) requires goal_radius via unpack(), so the resample crashes with "Missing required keyword argument 'goal_radius'" once a scenario batch completes and a new one is requested. Latent bug — never triggered because the legacy [eval] section had multi_scenario_eval = False as the default, so the eval rollout path that triggers resample never fired in production. The new EvalManager flips multi_scenario_eval-equivalent ([eval.validation_gigaflow]) on by default, surfacing the crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single section now does both: 250-scenario metric pass + 5-scenario render pass via the render_num_scenarios knob. The split made sense before render became a per-section flag; it's redundant now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PufferEnv backend in pufferlib.vector.make treats env_creator as a single callable and passes env_args/env_kwargs to it directly (line 697 in vector.py). Multiprocessing/Serial backends expect lists. Render must use PufferEnv (one ffmpeg pipe per env), so pass a single creator + dict, not lists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 8-map carla dir doesn't need 250 scenarios — gigaflow's C-side eval already creates one internal env per scenario (capped at num_scenarios) and steps them in a single batched rollout, so num_scenarios=8 covers every map exactly once in parallel. Drop num_agents 512→400 to fill exactly 8 × 50 slots with no wasted capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_length
Render path was falling back to env.scenario_length (3000 = 100s mp4 at
30fps) because render_max_steps wasn't a real config knob. Two fixes:
1. multi_scenario.py: read eval.render_max_steps from args["eval"]
where _build_eval_args puts evaluator-private fields, and default
to 91 (~3 sec) — not scenario_length, which is the metric pass.
2. drive.ini: set eval.render_max_steps = 91 on validation_gigaflow
explicitly so the comment + value document the intent.
EGL render is ~3 fps wall-clock at 1080p, so 91 × 5 × 2 ≈ 5 min/render
pass instead of 28 min when defaulting to scenario_length=3000.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old MultiScenarioEvaluator forced Multiprocessing + async_reset by
default, which (a) added fork/IPC startup cost without throughput gain
at our scale, (b) broke render (one ffmpeg pipe per env requires
single-process), and (c) couldn't construct the vec env at all because
the manager passed the PufferEnv-shaped single creator to the list-shaped
backend dispatch. Three changes:
1. multi_scenario.py: drop the vec_overrides() and _initial_reset()
overrides that pinned Multiprocessing — inherit base-class PufferEnv
+ sync reset.
2. manager.py:_run_inline: branch on backend so Multiprocessing remains
a valid opt-in via [eval.<name>.vec] backend = "Multiprocessing".
Useful for memory-bound replay sweeps, hetero-config evals, or
async overlap on long rollouts — not needed for the 8-map carla
validation eval that's the hot path today.
3. The 8-map validation eval now spins up one Drive with 8 internal
envs in-process, no fork.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path
Two fixes for the multi-evaluator smoke test:
1. multi_scenario._render_view: glob filter excludes mp4s that already
existed in out_dir before this pass. The dir is shared across epochs
and across views, so a bare `out_dir.glob('*.mp4')` was returning
every mp4 from prior render passes too — turning epoch 12510's
16-mp4 render into a 32-mp4 wandb upload (the 16 from epoch 12505
were still on disk).
2. drive.ini: replace `/scratch/$USER/data/nuplan/...` with the literal
path. configparser doesn't expand env vars, and the manager doesn't
either, so behavior evals were crashing on FileNotFoundError trying
to open `/scratch/$USER/data/...`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `{scenario_id}{view_suffix}.mp4` — each render epoch
overwrote the previous epoch's mp4 in place, making the prior fix's
existing-file snapshot filter incorrect (filter would exclude the
freshly-written file because Path equality matches the prior path).
Now `{scenario_id}_step{N}{view_suffix}.mp4` so:
- successive eval epochs produce distinct mp4s (no overwrites).
- wandb's render carousel shows one entry per epoch, letting the
user watch policy evolve over training.
- the return-paths glob is exact: just files matching this step's
pattern, no snapshot-filtering trickery.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior-class evaluators previously had render=true in config but no
_render_pass implementation, so the videos silently never appeared.
Three changes:
1. base.py: lift _render_pass + _render_view from MultiScenarioEvaluator
into Evaluator. The render loop is generic — fresh PufferEnv with
render_mode=headless, ffmpeg pipe per active env per view, mp4s
stamped with global_step in the filename.
2. base.py: add _render_env_overrides hook so subclasses can tweak the
render env (default = metric env + render_mode=headless). Render
loop now caps internal-env render count at eval.render_num_scenarios
instead of always rendering the full batch (the C kernel still steps
the full batch — just fewer ffmpeg pipes). Drops the dead
batch_size_eval write.
3. multi_scenario.py: keeps the random-starting_map override (its only
real difference from the default render path) — everything else is
now inherited.
Result: behavior_class and human_replay inherit a working render path.
For 12 behavior classes with render_num_scenarios=2, render is bounded
by ~3 fps × 300 steps × 2 views ≈ 3.5 min/class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every evaluator runs at the same global_step, so the previous shared out_dir + step glob made each evaluator's _render_view re-collect every earlier evaluator's mp4s into result.frames. wandb then logged validation_gigaflow's videos under behaviors_*/render too. Per-evaluator subdir (`mp4/<ev_name>/`) keeps each evaluator's render output isolated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… aliasing
Every behavior_class evaluator was rendering the same bin (the LAST
section's map_dir won for all siblings) because _deep_merge mutates
its dst in place, but _expand_dotted's shallow-copy meant the first
level merged into a fresh `merged` dict had `merged["env"]` aliasing
the parent section's env dict directly. The next level's merge then
recursed into that alias and overwrote keys on the parent itself, so:
- building child_a left defaults["env"]["map_dir"] = a's map_dir
- building child_b read the polluted defaults, set map_dir = b's
- … repeated through all siblings — every cfg["env"] is the same
reference, the last write wins for everyone.
Diagnostic showed all 12 behavior evaluators returning map_dir =
unprotected_right (the last section), num_maps=1, num_agents=1.
Fix: deepcopy each level before merging so `merged` owns its own
nested dicts and the parent section dict in `all_sections` stays
pristine across sibling builds.
Plus a regression test that builds three siblings inheriting from
the same parent and asserts each gets its own map_dir AND that the
parent dict is untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Render mp4 filenames now use `_epoch{E}_step{N}{view_suffix}.mp4`
instead of just `_step{N}{view_suffix}.mp4`. Epoch is the human-readable
checkpoint index ("which epoch did this come from"), step is the
precise env-step count. Both are useful — epoch for sorting through
training history at a glance, step for tying back to wandb's x-axis.
Threads `epoch` through maybe_run → _run_one → _run_inline →
_build_eval_args → args["epoch"], where _render_pass picks it up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enarios=2
The 'Template — no type, never instantiated' inline comment on
[eval.behaviors_defaults] was redundant with the section-header block
above ('All inherit from the template below') and read like internal
prose mid-section.
Also drop the explicit `eval.render_num_scenarios = 2` whose comment
defended pinning lower than the default of 3 — the 48-vs-72 difference
in mp4 count isn't load-bearing, and the contradiction (set 2, comment
defaults 3) was confusing. Inherits the default 3 now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efaults base.py:_render_view used to default `render_num_scenarios = min(metric_count, 3)` and `render_max_steps = 300` if the section didn't set them. That hides the render-cost knob inside code where readers won't see it. Render is expensive (EGL frame-by-frame at 1080p, ffmpeg encode + wandb upload), and the right budget depends on scenario_length and how many evaluators are firing — both are config-side concerns. Now: if `render = true`, the section MUST set both `eval.render_num_scenarios` and `eval.render_max_steps`. Missing either raises KeyError at first render attempt. drive.ini: behaviors_defaults already had render=true; add explicit render_num_scenarios=3 and render_max_steps=200 (=full bin at scenario_length=201). validation_gigaflow already set both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default rollout loop now reads `terminals` and `truncations` from
vecenv.step (previously discarded) and masks the LSTM state per-agent —
either signal means 'episode over, env reset' so we OR them. Agents
whose env truncated/terminated start the next forward with zeroed
hidden state; the rest carry recurrent memory across the step.
This replaces the whole-batch heuristic (`steps % scenario_length == 0`)
that lived in MultiScenarioEvaluator._maybe_reset_lstm. The heuristic
only worked because gigaflow validation runs uniform scenarios with
termination_mode=0, where every env truncated on the same Python tick.
It would silently desync if any env truncated early (e.g.
termination_mode=1, future per-bin scenario_length).
Also fixes the silent correctness bug for behavior_class / human_replay
evaluators, which inherited the base no-op and were never resetting
LSTM state across scenario boundaries. (Latent because rnn_name=None in
the smoke configs we've been running, but it'd bite as soon as RNN is
enabled.)
- base.py: capture terminals/truncations from vecenv.step; mask state
per-agent with `(1 - (term | trunc)).reshape(-1, 1)` before next
forward.
- base.py: drop the `_maybe_reset_lstm` hook (now redundant).
- multi_scenario.py: drop the override (was the only implementer).
- tests: regression test that constructs a 4-agent rollout where 3
agents fire terminate/truncate and one stays alive; asserts only
the alive agent's state survives.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each test now opens with a 1-3 line docstring stating what it asserts and why — useful for future readers scanning the file or for failure output to make the intent obvious without re-reading the body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the five overlapping eval pipelines with one base class + manager. See
docs/eval_unification.md(gitignored working spec) and the design discussion preceding this PR.WIP — do not merge yet. Functional integration tested locally on macOS (15 unit + map-type tests green); not yet exercised on the cluster against a real training run.
What changed
pufferlib/ocean/benchmark/evaluators/package: `Evaluator` base class + `EvalResult` dataclass + four subclasses (`MultiScenarioEvaluator`, `HumanReplayEvaluator`, `BehaviorClassEvaluator`, `WOSACEvaluator`).Net diff: ~+1100, ~−1900.
Tests
Test plan
Out of scope
🤖 Generated with Claude Code